A little overview

The miniature dataset ERR provided consists of a few music related shows and 50 news shows of Päevakaja. I’ll provide a quick overview of the latter, more consistent set. There’s not much data in the xml files There’s tags for date of recording, topics (fortunately standardized tags it seems), and then a general “contents” (sisu) tag which usually starts with the name of the editor/speaker (toimetaja), which is usually delimited from the rest by a *, but this is not consistent, nor are name conventions, so getting names out is a bit tricky. The contents tag sometimes has a sentence, sometimes a paragraph, seemingly cut off; sometimes more names (reporters?) and references to music tracks.


Co-occurrences of topics and editors

Hover over cells to see more info:


The same in a network format, removed links between topics, otherwise it’s just a hairball. This plot is interactive: zoom in and click on the nodes to see connections, drag to move: